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ABSTRACT 

Summary: We present Rust-Bio, the first general purpose bio¬ 
informatics library for the innovative Rust programming language. 
Rust-Bio leverages the unique combination of speed, memory safety 
and high-level syntax offered by Rust to provide a fast and safe 
set of bioinformatics algorithms and data structures with a focus on 
sequence analysis. 

Availability: Rust-Bio is available open-source under the MIT license 

at https://rust-bio.github.io 

Contact: koester@jimmy.harvard.edu 

1 INTRODUCTION 

With ever increasing amounts of experimental data being generated, 
their computational analysis becomes increasingly challenging. 
For novel or custom problems where carefully engineered high- 
performance standalone tools (like read mappers) are not yet 
available, general purpose bioinformatics libraries can help to 
minimize the coding effort. Bioinformatics libraries are published 
for many popular programming language s, e.g., SeoAn for 
C-H-, Biopython, Bioperl an d Bi oRubv jPoring et all l2008l : 
ICock et a/.Ll200^: IStaiich et al[ l2002l: iGoto e? a/.Ll2010h . Choosing 
the programming language for a specific task usually entails a 
tradeoff between execution and development speed. Low-level 
system programming languages like C or C-l-l- provide optimal 
performance at the cost of increased complexity. Higher level 
languages like Python or Perl provide a more concise syntax while 
leading to computational overhead introduced by online memory 
management (e.g. reference counting or garbage collection), type 
inference and not being compiled but interpreted during execution. 
Often, the combination of a high-level language with some carefully 
engineered implementations of a bioinformatics library is a good 
choice to quickly solve a problem with reasonable performance. 
However, the amounts of data the bioinformatics community is 
facing in the coming years and the need to handle nature’s resources 
carefully implies that using a high-performance, compiled language 
is still beneficial for certain problems. 

Recently, Rui-Q has gained attention as a new programming 
language combining speed with memory safe ty and high-level 
synta ctical features. Being compiled with LLVM l lLattner and Adve| . 
|2004|) . Rust has many advantages of low-level, system programming 


* http://www.rust-lang.org 


languages, such as speed and a small memory footprint. Supporting 
automatic type inference, it’s code is often less verbose than C 
or C-l-h code. With Rust, type inference happens at compile time, 
such that runtime overhead (appearing with scripting languages 
like Python) can be avoided. The key feature of Rust is a 
concept of ownership and borrowing of variables, that enables the 
compiler to automatically decide about lifetime of objects during 
compile time, making an online memory management superfluous 
without requiring manual freeing of resources. At the same time, 
this concept prevents common sources of errors with low-level 
languages like accessing invalid memory regions. Finally, the 
ownership concept enforces thread-safety, such that race conditions 
cannot occur. These features make Rust a promising solution to 
above tradeoff problem. 

In this work, we present Rust-Bio, the first general purpose 
bioinformatics library for the Rust programming language. Rust- 
Bio provides a high-level, fast and safe API for many state-of-the-art 
data structures and algorithms used in bioinformatics. 

2 LIBRARY 

Rust-Bio is built with the following principles in mind. Where 
possible, iterators are returned. This allows to process streams of 
data with minimal memory footprint. On top, using the extensive 
set of iterator tools available in Rust, iterators can be e.g. filtered, 
modified, chained or combined in an easy way. If a language 
data type appears suitable, we avoid to enclose data into a custom 
object. This mimimizes memory usage and increases flexibility 
when handling the data: e.g. biological sequences are represented 
as vectors or slices of bytes in ASCII encoding. This allows to use 
sequences with all algorithms and functions in e.g. the Rust standard 
library that work with byte vectors or slices. Each implemented 
algorithm is automatically tested via continuous integratioifl For 
each algorithm and data structure, we provide complexities in the 
documentation. Where more than one alternative is available, the 
documentation tries to highlight distinguishing use cases. So far, 
Rust-Bio is focused on algorithms and data structures for biological 
sequences. A central component of Rust-Bio are alphabets, which, 
e.g., allow to check in linear time whether a given sequence is a word 
over the alphabet, transform symbols to their lexicographical ranks 
and perform bit-encoding to save memory or iterate over q-grams. 


^ https://travis-ci.org 
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Listing 1. Creating an FM-Index for a given text with an occurence table 
sampling rate of 3. Here, the alphabet is used to provide guarantees for being 
able to limit memory usage during FM-Index constmction. Afterwards, we 
iterate over a FASTQ file, use the alphabet to validate read sequences and 
search for exact matches in the FM-Index. 

let alphabet = alphabets ::dna::iupac_alphabet(); 
let pos = suffix_array(text); 
let bwt = bwt(text , &pos); 

let fmindex = FMIndex : : new(&bwt , 3, &alphabet); 

let reader = fastq::Reader::from_file(” reads.fastq”); 
for record in reader.records () { 
let seq = record . seq () ; 
if alphabet . is.word ( seq ) { 

let interval = fmindex . backward.search (seq . iter ()); 
let positions = inter val . occ(&pos ); 

} 

} 


Rust-Bio can read and write common file formats like PASTA, 
FASTQ and BED. For SAM/BAM, CRAM, and VCF/BCF support 
it is complemented by Rust-HTSlib. 

Especially when considering sequencing data, many problems 
can be solved with a set of well establi shed data structures like 
suffix arra ys llManber and Mver^, Il99(]l) . the Burrows-Wheeler 


Transform ( Buirows and WheelerLll994h . rank/select data structures 
jjacobsonl 


and g-gram indices. In line with that, Rust- 
Bio implements i nduced sorting for suffix array const r uction 
iNong et al 1 1200^ . the FM-Index iFerragina and Manzinil I 2 OOOI) 
for pattern matching on top of the Burrows-Wh eeler Transform, 
a pra ctical variant of a rank/select data structure dOonzalez et g/.L 
bOOSll and a g-gram index for arbitrary alphabets a nd g < 32. 
Further, Rust-Bio implements the FMD-Index (Q l2012tl . that 
allows to find supermaximal exact matches in DNA sequences and 
their reverse complements in linear time. 

Implementations for many classical pattern matching algorithms 
are provided, including the algorithm of Knuth, Morris and Pratt, 
Backward Nondeterministic DAWG Matching, Backward Oracle 
Matching, the algorithm of Horspool, and the Shift-And algorithm 
jKnuth^ iy98l : IXilauzen et a/.l 1 19991 : 

iHorspooif 1980l : WuandManbeJ , 1992ll . In the supplement, we 
compare the speed of these algorithms against the C-n - based Seqan, 
which is among the fastest bioinformatics libraries jPoring et g/.L 
l2008h . The benchmarks exemplify that the speed of Rust- 
Bio is comparable to that of C-l-l- based implementations. For 
approximate pat tern matching, U kkonen’s dynamic programming 
based a lg orithm jUkkonenLI 1985h and Myer’s bit-parallel algorithm 
jMv i [I^ are provided. Finally, Rust-Bio implements 
local, global and semi-global pairwise sequence alignment as 
variants of the Smith-Wat e rman a nd Needleman-Wunsch algor ithms 
^Needleman and WunsctiL Il97d : ISmith and WatermanL Il98lh . An 
example for using the Rust-Bio API can be seen in Listing[T] 


3 CONCLUSION 

Rust-Bio is a general purpose bioinformatics library. Building on 
the innovative Rust programming language, Rust-Bio combines 
memory safety with speed, complemented by rigorous continuous 
integration tests. So far, a wide set of algorithms and data structures 


for biological sequences is provided, ranging from index data 
structures to pattern matching and alignment, complemented by 
readers and writers for common file formats. 
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Rust-Bio—a fast and safe bioinformatics library: 
Supplement 

Benchmarks of pattern matching algorithms 

Since Rust-Bio is based on a compiled language, similar performance to G/C++ 
based libraries can be expected. Indeed, we find the pattern matching algo¬ 
rithms of Rust-Bio to perform in the range of the C-I--I- library SeqAr0: 


Algorithm 

Rust-Bio 

Seqan 

BNDM 

77ms 

80ms 

Horspool 

122ms 

125ms 

BOM 

103ms 

107ms 

Shift-And 

241ms 

545ms 


We measured 10, 000 iterations of searching pattern 

GCGCGTACACACCGCCCG 

in the sequence of the human MT chromosome (assembly hg38). Initialization 
time of each algorithm for the given pattern was included in each iteration. 
Benchmarks were conducted with Cargo bench for Rust-Bio and Python timeit 
for SeqAn on an Intel Core i5-3427U CPU. Benchmarking SeqAn from Python 
timeit entails an overhead of around 1.46ms for calling a C-|—h binary. This 
overhead was subtracted from above run times. 

Note that this benchmark only compares the two libraries to exemplify that 
Rust-Bio has comparable speed to C-|—I- libraries: all used algorithms have ad¬ 
vantages depending on text and pattern structures and lengths. Details about 
when to use which pattern matching algorithm can be found in the documen¬ 
tation of Rust-Bio’s pattern matching module. 


^http://www.seqan.de 
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